CONTRAlign: Discriminative Training for Protein Sequence Alignment
نویسندگان
چکیده
In this paper, we present CONTRAlign, an extensible and fully automatic framework for parameter learning and protein pairwise sequence alignment using pair conditional random fields. When learning a substitution matrix and gap penalties from as few as 20 example alignments, CONTRAlign achieves alignment accuracies competitive with available modern tools. As confirmed by rigorous cross-validated testing, CONTRAlign effectively leverages weak biological signals in sequence alignment: using CONTRAlign, we find that hydropathy-based features result in improvements of 5-6% in aligner accuracy for sequences with less than 20% identity, a signal that state-of-the-art hand-tuned aligners are unable to exploit effectively. Furthermore, when known secondary structure and solvent accessibility are available, such external information is naturally incorporated as additional features within the CONTRAlign framework, yielding additional improvements of up to 1516% in alignment accuracy for low-identity sequences.
منابع مشابه
A max-margin model for efficient simultaneous alignment and folding of RNA sequences
MOTIVATION The need for accurate and efficient tools for computational RNA structure analysis has become increasingly apparent over the last several years: RNA folding algorithms underlie numerous applications in bioinformatics, ranging from microarray probe selection to de novo non-coding RNA gene prediction. In this work, we present RAF (RNA Alignment and Folding), an efficient algorithm for ...
متن کاملDiscriminative Structured Models for Biological Sequence Analysis a Dissertation Submitted to the Department of Computer Science and the Committee on Graduate Studies of Stanford University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy
Making predictions is a key element in many computational biology applications: given a set of input biological sequences, use an inference procedure to generate some corresponding predicted output. The prediction process involves defining an appropriate scoring model for comparing alternative output predictions, developing efficient inference algorithms for choosing high-scoring outputs, and c...
متن کاملTraining Protein Threading Models using Structural SVMs
Protein threading is the problem of inferring the structure of a protein from its sequence by matching the sequence against a set of known structures. Unlike conventional sequence to sequence alignment tasks, alignment models for threading can exploit a rich set of features derived from the geometry of the known structure. To make use of these complex and interdependent features, we explore the...
متن کاملDiscriminative Pruning for Discriminative ITG Alignment
While Inversion Transduction Grammar (ITG) has regained more and more attention in recent years, it still suffers from the major obstacle of speed. We propose a discriminative ITG pruning framework using Minimum Error Rate Training and various features from previous work on ITG alignment. Experiment results show that it is superior to all existing heuristics in ITG pruning. On top of the prunin...
متن کاملLearning to Align Sequences: A Maximum-Margin Approach
We propose a discriminative method for learning the parameters of linear sequence alignment models from training examples. Compared to conventional generative approaches, the discriminative method is straightforward to use when operations (e.g. substitutions, deletions, insertions) and sequence elements are described by vectors of attributes. This admits learning flexible and more complex align...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006